When
a site becomes unavailable because of a physical access limitation or a
disaster such as a fire or earthquake, steps must be taken to provide
the recovery of the Exchange server in the site. Exchange Server does
not have a single-step method of merging information from the failed
site server into another server, so the process involves recovering the
lost server in its entirety.
To prepare for the
recovery of a failed site, an organization can create redundancy in a
failover site. With redundancy built into a remote site, the recovery
and restore process can be minimized if a recovery needs to be
performed.
For environments in
which SLAs offer little time to bring up a recovery location,
administrators should strongly consider implementing Database
Availability Groups, a new feature of Exchange Server 2010 that replaces
CCR and SCR.
Creating Redundant and Failover Sites
Redundant sites are created
for a couple of different reasons. First, a redundant site can have a
secondary Internet connection and bridgehead routing server so that if
the primary site is down, the secondary site can be the focus for
inbound and outbound email communications. This redundancy can be built,
configured, and set to automatically provide failover in case of a site
failure.
The
other reason for a redundant site is to provide geographic failover to
allow for transparent disaster recovery. In Exchange Server 2010,
although you could build a “warm standby” site in which you would
install Exchange Server 2010 when needed for recovery, that would
provide no benefits versus building a redundant site that is already
replicated with the mailbox data. This is exactly what Database
Availability Groups provide when placed in a site that also has the
Client Access Server and Hub Transport Server roles available.
If you plan to utilize
redundant DR sites, be sure to update those sites with patches and
applications as you apply them to the production systems. This ensures
that the remote replicas are usable should you have a failure in the
primary location.
Creating the Failover Site
When an organization
decides to plan for site failures as part of a disaster recovery
solution, many areas need to be addressed and many options exist. For
organizations looking for redundancy, network connectivity is a
priority, along with spare servers that can accommodate the user load.
The spare servers need to have enough disk space to accommodate a
complete restore. As a best practice, to ensure a smooth transition, the
following list of recommendations provides a starting point:
Allocate
the appropriate hardware devices, including servers with enough
processing power and disk space to accommodate the restored machines’
resources.
Host
the organization’s external DNS zones and records using primary DNS
servers located at an Internet service provider (ISP) collocation
facility, or have redundant DNS servers registered for the domain and
located at both physical locations.
Publish
the recovery site’s IP address as a lower-priority MX record. This way,
when the recovery server comes online, you won’t have to wait for DNS
propagation to advertise the new MX record.
Ensure that network connectivity is already established and stable between sites and between each site and the Internet.
Create
at least one copy of backup tape medium for each site. One copy should
remain at one location, and a second copy should be stored with an
offsite data storage company. This is necessary only if recovery of
mailbox data beyond the internal retention policies is needed.
Have
a copy of all disaster recovery documentation stored at multiple
locations and at the offsite data storage company. This provides
redundancy if a recovery becomes necessary.
When the systems are in
place in the failover site and configured to support a Database
Availability Group, the data will automatically be replicated from the
master copy and will be
available when needed. Be sure to account for the amount of replication
traffic that will be passed over the WAN to the disaster recovery site.
Although the log files are compresses, they are still potentially a
large source of data. To get an idea of the amount of data that will be
replicated, look at the volume of log files generated on the primary
server each day. That is the amount of data that will be replicated to
each replica. For sites running multiple replicas across WAN
connections, this can be a significant volume of data.
Failing Over Between Sites
When utilizing
Database Availability Groups with replicas in a failover site that also
has CAS and HT roles available, the process of failing services from the
primary site to the DR site is easy:
1. | Launch Exchange Management Console.
|
2. | Expand Organization Configuration.
|
3. | Click mailbox.
|
4. | Click the Database Management tab.
|
5. | Right click the database copy you’d like to activate.
|
6. | Select Activate Database Copy.
|
7. | When the wizard launches, if desired, enter an override mount dial for the operation; click OK.
|
8. | When the wizard is completed, click Finish.
|
The same process can be done entirely from the Exchange Management Shell as well by following these steps:
1. | Launch Exchange Management Shell
|
2. | Type Move-ActiveMailboxDatabase –Identity DBName –ActivateOnServer
NewServer.
For example, Move-ActiveMailboxDatabase –Identity 'Mailbox Database 2010A' –ActivateOnServer E2010.
|
If, on the other
hand, the failover isn’t a planned event, the mailbox databases within a
DAG will be automatically failed over to the site holding the second
highest-priority copy of the mailbox database. The preceding steps would
primarily be used for DR testing or to move services to enable systems
to be patched or upgraded in some manner.
Failing Back After Site Recovery
When the initial
site is back online and available to handle client requests and provide
access to data and networking services and applications, it is time to
consider failing back the services. This process is greatly improved in
Exchange Server 2010 through the use of Database Availability Groups.
Unlike SCR, which was used for DR in Exchange Server 2007,
there is no need to reestablish the replication relationship. A DAG
simply continues to replicate mailbox data to all other replicas. This
means that if mailbox master status is moved from ServerA to ServerB,
ServerB will replicate to ServerA. If, on the other hand, ServerA were
unavailable for an extended period of time and ServerB were to become
too far out of sync and ServerA needed to be reseeded, Exchange Server
2010 supports the concept of incremental reseeding; the amount of data
that would need to be sent back to ServerA would be significantly less
than it would have been in Exchange Server 2007 with SCR.
Questions to consider for failing back are as follows:
Will downtime be necessary to restore databases between the sites?
When is the appropriate time to fail back?
Is
the failover site less functional than the preferred site? In other
words, are only mission-critical services provided in the failover site,
or is it a complete copy of the preferred site?
The answers lie in the
complexity of the failed-over environment. If the cutover is simple,
there is no reason to wait to fail back.
Providing Alternative Methods of Client Connectivity
When failover sites are
too expensive and are not an option, it does not mean that an
organization cannot plan for site failures. Other lower-cost options are
available but depend on how and where the employees do their work. For
example, many times users who need to access email can do so without
physically being at the site location. Email can be accessed remotely
from other terminals or workstations.
The following are some ways to deal with these issues without renting or buying a separate failover site:
Consider renting racks or cages at a local ISP to co-locate servers that can be accessed during a site failure.
Have users dial in from home to a terminal server hosted at an ISP to access Exchange Server.
Set
up remote user access using Terminal Services or Outlook Web App at a
redundant site so that users can access their email, calendar, and
contacts from any location.
Configure
Outlook to utilize Outlook Anywhere on “slow” connections. This enables
them to connect normally while in the office but can utilize “public”
connections to connect should the office be unavailable.
Rent
temporary office space, printers, networking equipment, and user
workstations with common standard software packages such as Microsoft
Office and Microsoft Internet Explorer. You can plan for and execute
this option in about one day. If this is an option, be sure to find a
computer rental agency first and get pricing before a failure occurs,
and you have no choice but to pay the rental rates.